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Memory  Hierarchy 


Motivation:  The  Mapping  Problem 


intricate  math 

intricate 
memory 
accesses 
(indexing) 
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Approach 


Mathematics  of  Arrays 

•  Math  and  indexing  operations  in  same 
expression 

•  Framework  for  design  space  search 

-  Rigorous  and  provably  correct 

-  Extensible  to  complex  architectures 


Exampie:  “raising”  array 
dimensionality 


x:  <  0  1  2  ...  35  > 
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Basic  Idea 
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Benefits 

•  Theory  based 

•  High  level  API 

•  Efficient 


Implementation 


Theory 


Combining  Expression  Templates  and  Psi  Calculus  yields 
an  optimal  implementation  of  array  operations 
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Psi  Calculus^  Key  Concepts 
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Some  Psi  Calculus  Operations 


Operations 

Arguments 

Definition 

take 

Vector  A,  int  N 

Forms  a  Vector  of  the  first  N  elements  of  A 

drop 

Vector  A,  int  N 

Forms  a  Vector  of  the  iast  (A.size-N)  elements  of  A 

rotate 

Vector  A,  int  N 

Forms  a  Vector  of  the  last  N  elements  of  A 
concatenated  to  the  other  elements  of  A 

cat 

Vector  A,  Vectors 

Forms  a  Vector  that  is  the  concatenation  of  A  and  B 

unaryOmega 

Operation  Op,  dimension  D, 
Array  A 

Applies  unary  operator  Op  to  D-dimensional 
components  of  A  (like  a  for  ali  loop) 

binaryOmega 

Operation  Op, 

Dimension  Adim. 

Array  A,  Dimension  Bdim, 
Array  B 

Applies  binary  operator  Op  to  Adim-dimensional 
components  of  A  and  Bdim-dimensionai  components 
of  B  (iike  a  for  ail  loop) 

reshape 

Vector  A,  Vectors 

Reshapes  B  into  an  array  having  A.size  dimensions, 
where  the  iength  in  each  dimension  is  given  by  the 
corresponding  element  of  A 

iota 

intN 

Forms  a  vector  of  size  N,  containing  vaiues  0  . .  N-1 

I  I  =  index  permutation 


I  I  =  operators 


I  I  =  restructuring 


I  I  =  index  generation 
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Convolution:  Psi  Calculus 
Decomposition 


Y 


Definition 

of 

y=conv(h,x) 

M-1 

y[n]= 

^=0 

k]  where  x  has  N  elements,  h  has  M  elements,  0  n<N+M-1,  and 
x’  is  X  padded  by  M-1  zeros  on  either  end 

Algorithm  step 

Psi  Calculus 

Algorithm 

Initial  step 

x=<1234>  h=<567> 

x=<1234>  h=<567> 

and 

Psi  Calculus 
Decomposition 

Form  x’ 

x’=cat(reshape(<k-1>,  <0>),  cat(x,  reshape(<k-1>,<0>)))= 

x’=<001  ...400> 

rotate  x’ 
(N+M-l)  times 

x’  pQt=binaryOmega(rotate,0,iota(N+M-1),  1  x’) 

<001  2...> 

X’  rot=  <  0  1  2  3  ...  > 

<  1  2  3  4  ...  > 

• 

• 

take  the  “interesting” 
partofxVot 

x’  fj„3|=binaryOmega(take,0,reshape(<N+M-1  >,<M>),1  ,x’,.Qt) 

<  0  0  1  > 

■  <  0  1  2  > 
final  <  1  2  3  > 

• 

• 

• 

multiply 

Prod=binaryOmega  (*,1,  h,1,x’ 

<0  0  7  > 

Prod=  ®  ®  ^ 

<  5  12  21  > 

• 

• 

• 

sum 

Y=unaryOmega  (sum,  1,  Prod) 

Y=  <  7  20  38  .  .  .  > 

Psi  Calculus  reduces  this  to  DNF  with  minimum  memory  accesses 
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Typical  C++  Operator  Overloading 
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A 


Example:  A=B+C  vector  add 


Main 


1.  Pass  B  and  C 
references  to 
operator  + 


Operator  + 


2.  Create  temporary 
result  vector 

3.  Calculate  results, 
store  in  temporary 

4.  Return  copy  of 
temporary 


5.  Pass  results  reference 

to  operator= 

_ 


Operator  = 


6.  Perform  assignment 


2  temporary  vectors  created 


Additional  Memory  Use 


•  Static  memory 

•  Dynamic  memory 
(also  affects 
execution  time) 
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Additionai  Execution  Time 


•  Cache  misses/ 
page  faults 

•  Time  to  create  a 
new  vector 

•  Time  to  create  a 
copy  of  a  vector 

•  Time  to  destruct 
both  temporaries 

MIT  Lincoln  Laboratory  “ 


C++  Expression  Templates  and  PETE 


Parse  Tree _ Expression  Type 


Expression 

BinaryNode<OpAdd,  Reference<Vector>, 

A=B+C 

Expression 

Templates 

Reference<Vector  >  > 

Main 


1.  Pass  B  and  C 
references  to 
operator  + 
_ 


Parse  trees,  not  vectors,  created 


Operator  + 


2.  Create  expression 
parse  tree 

3.  Return  expression 
parse  tree 


4.  Pass  expression  tree 
reference  to  operator 


Operator  = 

o.  oaicuiaie  resuii  ana 
perform  assignment 

B+C 

Reduced  Memory  Use 


PETE,  the  Portable  Expression  Template  Engine,  is  available  from  the 
Advanced  Computing  Laboratory  at  Los  Alamos  National  Laboratory 


•  Parse  tree  only 
contains  references 


Reduced  Execution  Time 

•  Better  cache  use 

•  Loop  fusion  styie 
optimization 

•  Compile -time 
expression  tree 
manipulation 


•  PETE  provides: 

-  Expression  template  capability 

-  Facilities  to  heip  navigate  and  evaluating  parse  trees 


PETE:  http://www.acl.lanl.gov/pete 
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Outline 


*  Overview 

-  Motivation 

-  The  Psi  Calculus 

-  Expression  Templates 

*  Implementing  the  Psi  Calculus  with  Expression  Templates 

*  Experiments 

*  Future  Work  and  Conclusions 


020723-er-10 
KAM  10/3/02 


MIT  Lincoln  Laboratory 


Implementing  Psi  Calculus  with 
Expression  Tempiates 


Example: 

A=take(4,drop(3,rev(B))) 

B=<1  2  345  67  89  10> 
A=<7  6  5  4> 
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Implementing  Psi  Calculus  with 
Expression  Tempiates 
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Implementing  Psi  Calculus  with 
Expression  Tempiates 
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Implementing  Psi  Calculus  with 
Expression  Tempiates 
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Implementing  Psi  Calculus  with 
Expression  Tempiates 
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Implementing  Psi  Calculus  with 
Expression  Tempiates 
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Reduction 


Implementing  Psi  Calculus  with 
Expression  Tempiates 
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Reduction 


Implementing  Psi  Calculus  with 
Expression  Tempiates 
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Reduction 


dl 


Implementing  Psi  Calculus  with 
Expression  Templates 


Example: 

1.  Form  expression  tree 

Recall: 

A=take(4,drop(3,rev(B))) 

Psi  Reduction  for  1  -d  arrays 
aiways  yields  one  or  more 

B=<1  2  345  67  89  10> 

expressions  of  the  form: 

A=<7  6  5  4> 

3 

x[/]=y[stride*/+  offset] 

/  /<u 

/ 


\ 


\ 


S 

c 

o 

N 

w 


2.  Add  size  information 

size=4  (^akej 

■  ,( 
size=7' 

drop  j 

3  ^ 

size=1oO^\^ 

size=10 

3.  Apply  Psi  Reduction  rules 


size=4 


A[/]=B[-i+6] 


size=7 

size=10 

size=10 


A[/]  =B[-(/+3)+9] 
=B[-i+6] 

A[/]  =B[-/+B.size-1] 
=B[-/+9] 

AIil=B[il 


/ 


c 

o 

3 

"O 

o 
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Implementing  Psi  Calculus  with 
Expression  Tempiates 


dl 


Example: 

1.  Form  expression  tree 

Recall: 

A=take(4,drop(3,rev(B))) 

Psi  Reduction  for  1  -d  arrays 
aiways  yields  one  or  more 

B=<1  2  345  67  89  10> 

expressions  of  the  form: 

A=<7  6  5  4> 

3 

x[/]=y[stride*/+  offset] 

/  /<u 

/ 


\ 


\ 


S 

c 

o 

N 

w 


2.  Add  size  information 

size=4  (^akej 

■  ,( 
size=7' 

drop  j 

3  ^ 

size=1oO^\^ 

size=10 

4.  Rewrite  as  sub-expressions  with  iterators  at  the 
ieaves,  and  loop  bounds  information  at  the  root 


3.  Apply  Psi  Reduction  rules 


size=4 


A[/]=B[-i+6] 


size=7 

size=10 

size=10 


A[/]  =B[-(/+3)+9] 
=B[-i+6] 

A[/]  =B[-/+B.size-1] 
=B[-/+9] 

A[i]=B[i] 


c 

o 

3 

"O 

o 


•  Iterators  used  for  efficiency,  rather  than 
recalculating  indices  for  each  i 

•  One  “for”  loop  to  evaluate  each 
sub-expression 


020723-er-20 
KAM  10/3/02 


MIT  Lincoln  Laboratory 


Outline 
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Overview 

-  Motivation 

-  The  PSI  Calculus 

-  Expression  Templates 

Implementing  the  Psi  Calculus  with  Expression  Templates 
Experiments 

Future  Work  and  Conclusions 
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Experiments 


Y 


Z 
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Results 

•  Loop  implementation  achieves 
good  performance,  but  is 
probiem  specific  and  iow  ievei 

•  Traditionai  C++  operator 
impiementation  is  generai  and 
high  ievei,  but  performs  pooriy 
when  composing  many 
operations 

•  PETE/Psi  array  operators 
perform  aimost  as  weii  as  the 
ioop  impiementation,  compose 
weii,  are  generai,  and  are  high 
ievei 


Execution  Time  Normaiized  to  Loop  Implementation 

40 

30 

20 

10 

0 


Test  ability  to  compose  operations 
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Experimental  Platform  and  Method 


Hardware 

•  DY4  CHAMP-AV  Board 

-  Contains  4  MPC7400’s  and  1  MPC  8420 

•  MPC7400  (G4) 

-  450  MHz 

-  32  KB  L1  data  cache 

-  2  MB  L2  cache 

-  64  MB  memory/processor 


Software 

•  VxWorks  5.2 

-  Real-time  OS 

*  GCC  2.95.4  (non-official  release) 

-  GCC  2.95.3  with  patches  for 
VxWorks 

-  Optimization  flags: 

-03  -funroll -loops  -fstrict-aliasing 


Method 

*  Run  many  iterations,  report 
average,  minimum,  maximum  time 

-  From  10,000,000  iterations  for 
small  data  sizes,  to  1000  for  large 
data  sizes 

*  All  approaches  run  on  same  data 

*  Only  average  times  shown  here 

*  Only  one  G4  processor  used 


•  Use  of  the  VxWorks  OS  resulted  in  very  low  variability  in  timing 

•  High  degree  of  confidence  in  results 
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throughput  (elements/sec) 


Experiment  1 : 
A=rev(B) 


Y 


Z 
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•  PETE/Psi  implementation  performs  nearly  as  well  as  hand  coded  loop, 
and  much  better  than  regular  C++  implementation 

•  Some  overhead  associated  with  expression  tree  manipulation 
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throughput  (elements/sec) 


Experiment  2: 

a=rev(take(N,drop(M,rev(b))) 


Y 


Z 
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vector  size  (log  2) 


•  Larger  gap  between  regular  C++  performance  and  performance  of  other 
implementations  regular  C++  operators  do  not  compose  efficiently 

•  Larger  overhead  associated  with  expression-tree  manipuiation  due  to 
more  compiex  expression 
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throughput  (elements/sec) 


Experiment  3: 
a=cat(b+c,  d+e) 


Y 


Z 
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vector  size  (log  2) 


•  Still  larger  overhead  associated  with  tree  manipulation  due  to  cat() 

•  Overhead  can  be  mitigated  by  “setup”  step  prior  to  assignment 
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Outline 


•  Overview 

-  Motivation 

-  The  PSI  Calculus 

-  Expression  Templates 

*  Implementing  the  PSI  Calculus  with  Expression  Templates 

*  Experiments 

•  Future  Work  and  Conclusions 
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Future  Work 


•  Multiple  Dimensions:  Extend  this  work  to  N-dimensional  arrays 
(N  is  any  non-negative  integer) 

•  Parallelism:  Explore  dimension  lifting  to  exploit  multiple 
processors 

•  Memory  Hierarchy:  Explore  dimension  lifting  to  exploit  levels  of 
memory 

•  Mechanize  Index  Decomposition:  Currently  a  time  consuming 
process  done  by  hand 

•  Program  Block  Optimizations:  PETE-style  optimizations 
across  statements  to  eliminate  unnecessary  temporaries 
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Conclusions 


*  Psi  calculus  provides  rules  to  reduce  array  expressions  to 
the  minimum  of  number  of  reads  and  writes 

*  Expression  templates  provide  the  ability  to  perform 
compiier  preprocessor-styie  optimizations  (expression  tree 
manipuiation) 

*  Combining  Psi  caicuius  with  expression  templates  results 
in  array  operators  that 

-  Compose  efficiently 

-  Are  high  performance 

-  Are  high  levei 

*  The  C++  template  mechanism  can  be  applied  to  a  wide 
variety  of  probiems  (e.g.  tree  traversai  aia  PETE,  graph 
traversal,  list  traversal)  to  gain  run-time  speedup  at  the 
expense  of  compile  time/space 
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